Skip to content

Conversation

@brownbaerchen
Copy link
Contributor

After figuring out how to do DCT on GPUs with mpi4py-fft, I refactored a lot. The main goal was to get rid of the implementation of the DCT via FFTs in pySDC and leave that to the FFT libraries. One advantage is that this enables complex-to-complex DCT, whereas the previous implementation could only do real-to-real. Furthermore, all the implementation for transforms is now in the 1d basis implementations and not duplicated in the ND discretization implementation. Overall this implementation is a lot more straightforward and easier to follow than before.

Another reason for the refactor was going 3D. I had a version of 3D Rayleigh-Benard working with the previous implementation, as long as I used the same resolution in each direction. The problem was figuring out the global size of the array with padding when you don't know which axes have been transformed and which are distributed. In particular because there are multiple choices for distribution in >2D.
Since this version makes a lot more direct use of mpi4py-fft, this is now left to mpi4py-fft. The interface is pretty generic and I implemented a bunch of tests for the 3D stuff with padding.

Finally, the greater goal was to do 3D Rayleigh-Benard and this is also part of this PR, including tests for simple configurations where the result of the solver and right hand side evaluation are known.

It's a large PR and I do apologize. But I didn't see a way of merging the refactor in smaller chunks..

@pancetta
Copy link
Member

pancetta commented Jul 3, 2025

Not quite sure why the mirror fails.

Copy link
Member

@tlunet tlunet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. Just a few comments here and there, if you can address those

u_pad = self.itransform(u_hat, padding=padding).real

fexpl_pad = self.xp.zeros_like(u_pad)
fexpl_pad[iu][:] = -(u_pad[iu] * Dx_u_pad[iu] + u_pad[iv] * Dy_u_pad[iu] + u_pad[iw] * Dz_u_pad[iu])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need the [:] at the end ?

Also, try to use np.copyto instead of the = sign, it can be faster in some situations (and does exactly the same thing).

Finally, try to decompose operations of big vectors, e.g like :

out = a*u1 + b*u2 + c*u3

into

np.copyto(out, u1)
out *= a
out += b*u2
out += c*u3

Yes, it's a bit cumbersome, but it avoids allocating a lot of new variables during computation, which can be expensive in computation time and memory consumption. And you can still keep the non-optimized more readable code commented on top.

Remember that every time you do an operation that is not inplace with numpy arrays, you allocate a temporary one implicitly. Ideally, the eval_f method should only works with pre-allocated temporary arrays, trying to minimize the required number of temporaries, and only do in-place operations or copies. I suspect this has a big impact on performance for large size problems.

PS : this comment works also when you do matrix-vector multiplication. If possible, try to use some functions that write the result into an existing out array, as it can be done with the np.dot function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like good idea. Will do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some memory profiling and it turned out that storing big matrices of size 5N^3^2 with one entry for every component and every point in space requires loads of memory. I was storing a few because it made computations, in particular on GPU, faster. However, if the code is faster in theory but we can't fit it on device, we gain nothing. So I removed all unnecessary big matrices and apply smaller ones more often instead.
Also, I implemented some of the optimizations suggested by Thibaut above.
Each of these big matrices seemed to be roughly the size of 8 solution size objects which are kept in memory and I removed 5 of them, which should give a big impact. I'll have a look at the solver and then see how much more I can now fit into memory.

@pancetta
Copy link
Member

pancetta commented Jul 7, 2025

Monodomain project fixed in #566 and upstream

@pancetta
Copy link
Member

pancetta commented Jul 8, 2025

Anything else her? Ready to merge? Or do you want to modify this further?

@tlunet
Copy link
Member

tlunet commented Jul 8, 2025

Let's merge this for now, and eventually improve later. The PR is already big enough 😅

@pancetta pancetta merged commit 81495ee into Parallel-in-Time:master Jul 8, 2025
50 of 51 checks passed
@brownbaerchen brownbaerchen deleted the merge_refactor branch July 8, 2025 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants